Parametric high definition (PHD) speech synthesis-by-analysis: the development of a fundamentally new system creating connected speech by modifying lexically-represented language units
نویسندگان
چکیده
Our paper has 5 sections. In section (1) we will discuss critically the fact that the development of Text-to-Speech systems and Speech-to-Text systems has in the past been treated as totally separate problems (we restrict ourselves to so-called dictation systems, L2S and S2L, which either translate written language units L into speech signals S, or speech signals S into sequences of written language units L). In section (2) we argue that for this reason, in the future, theoretical and empirical work should be devoted to providing an approach that integrates the L2S and S2L components into a unified phonetic system, which is able to learn to speak a language and also to understand what other L2S-systems are saying. The new Munich PHD-system will be described in section (3) as an example of such a unified approach. Fundamental to this system is the selection and definition of lexically-given speech items, both acoustically and articulatorily (EMA). In section (4) we demonstrate a set of prosodic functions that take lexically-defined L-inputs and produce phonetically well-formed connected Soutputs. We discuss the possibility of combining certain elementary functions (such as those controlling F0 variation, segment duration, and sound modification) into a much more complex function which also controls the language-specific rhythmic variation of speech tempo in its locally measurable form. Finally section (5) will raise the question of analysing speech data produced by individual speakers as a means of arriving at the sound production system of a generalized representative member of the sociolect or dialect of the language in question. 1 Why Speech Technology has Treated Automatic Speech Recognition and Artificial Speech Synthesis as Two Separate Problems The main reason for conceiving the Munich PHD Speech Synthesis-by-Analysis-Program as a foundationally new approach to phonetic speech research can be seen in the fact that in the past — with only few citable exceptions such as Bridle & Ralls 1985 [1], Hadersbeck 1988 [2] — speech technology has been treating the problem of relating speech signals (i.e. measurable data in the form of time functions) and speech categories (i.e. symbolic data in the form of printable characters) in two quite different versions. In the context of developing systems of automatic speech recognition we have the version where the problem takes the following form: what is given consists of measurable speech signals produced by the speakers of a language, what we ask for is the category of the perceived utterance in terms of the language of the speaker. For the development of speech-synthesis-systems the question is asked in quite the opposite direction: what is given is a categorical representation of an utterance of a language, and what is sought is a speech signal that will be perceived by the speakers/listeners of that language as fulfilling all the categories as defined by the printable text of that utterance. As far as any text of any utterance can be related to a readable text in a given language we will use the term text as representing the category of an utterance. As far as any real speech utterance conveying the category of such a text must necessarily coincide with a measurable time function, we will use the term speech as representing the (digital) speech signal of a given utterance. Any regular speech utterance is thus a phonetic fact, consisting of a measurable speech signal and a related printable category. From a purely technical point of view it really makes sense to handle speech recognition and speech synthesis as two separate problems, referred to as Speech-to-Text, S2T, and Text-to-Speech T2S. One reason surely is that S2T is much more a bottom-up problem than T2S is. On the other hand top-down-components play a much greater role in T2S than they do in S2T. In S2T we have to define (and than also to detect) necessary conditions for deciding which words of the language could have been produced in a given utterance (if it is a clear regular utterance of that language), whereas in T2S it will be often adequate enough only to define sufficient conditions for creating a speech signal that fulfils only some of the conditions needed by the listeners to perceive the required text category. 2 Two Main Goals and the Distinction Between Two Types of Speech Acts Our decision to create a new research system at the University of Munich was motivated quite programmatically. The first programmatic component of this decision was to combine two different goals, a more practically oriented and a more theoretically oriented one. The second component consisted in introducing a clear distinction between two separate sets of speech acts. ! " ISCA Archive
منابع مشابه
Development and Psychometric Evaluation of Speech and Language Pathology Evidence-Based Practice Questionnaire (SLP-EBPQ)
Background: To date, there is no specific instrument to measure evidence-based practice (EBP) in Speech and Language Pathology (SLP). Therefore, it is essential to design a valid and reliable instrument in the EBP field for SLP. Aim: To develop a speech and language pathology evidence-based practice questionnaire (SLP-EBPQ) for the Iranian context and evaluate its psychometric properties. Metho...
متن کاملMusic Training Program: A Method Based on Language Development and Principles of Neuroscience to Optimize Speech and Language Skills in Hearing-Impaired Children
Introduction: In recent years, music has been employed in many intervention and rehabilitation program to enhance cognitive abilities in patients. Numerous researches show that music therapy can help improving language skills in patients including hearing impaired. In this study, a new method of music training is introduced based on principles of neuroscience and capabilities of Persian languag...
متن کاملStudy on Unit-Selection and Statistical Parametric Speech Synthesis Techniques
One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...
متن کاملAssessment and treatment of childhood apraxia of speech: An inquiry into knowledge and experience of speech-language pathologists
Objectives: The present research aimed to identify the assessment and treatment processes implemented by Iranian speech-language pathologists (SLPs) for CAS and to investigate the possibility of impact of their knowledge level and years of experience on their choice of assessment and treatment. Methods: A cross-sectional method using survey design was employed to obtain a sample of 260 SLPs w...
متن کاملThe Role of Private Speech Produced by Intermediate EFL Learners in Lexical Language Related Episodes
Private speech utilization is accepted to have a critical role in the continuum of language acquisition. As a valuable device in studying learners’ talk during interaction, a language related episode (LRE) is any part of a dialogue where a student speaks about a language problem s/he comes across while completing a task. The present study investigated the role of private speech produced by Inte...
متن کامل